Submitted to Publ. Astro. Soc. Pacific, January 2013 
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ABSTRACT 

VOStat is a Web service providing interactive statistical analysis of astronomical 
tabular datasets. It is integrated into the suite of analysis and visualization tools asso- 
ciated with the international Virtual Observatory (VO) through the SAMP communica- 
tion system. A user supplies VOStat with a dataset extracted from the VO, or otherwise 
acquired, and chooses among ~ 60 statistical functions. These include data transforma- 
tions, plots and summaries, density estimation, one- and two-sample hypothesis tests, 
global and local regressions, multivariate analysis and clustering, spatial analysis, direc- 
tional statistics, survival analysis (for censored data like upper limits), and time series 
analysis. The statistical operations are performed using the public domain R statistical 
software environment, including a small fraction of its > 4000 CRAN add-on pack- 
ages. The purpose of VOStat is to facilitate a wider range of statistical analyses than 
are commonly used in astronomy, and to promote use of more advanced methodology 
in R and CRAN. 

Subject headings: Data Analysis and Techniques 



1. Introduction 

Astronomy has always accounted for the production of a significant amount of scientific data. 
With recent developments in instrumentation, this amount surpassed most fields of science. Most 
of the astronomers, however, are still using traditional techniques to analyze the available data. 
While this approach has definitely produced many good results, it is expected to be even more 
productive if it is coupled with statistical techniques: the time-tested science of data analysis. 
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Both astronomy and statistics are well developed branches of science, each with its own set of 
journals, conferences, degrees and jargons that require time and patience to master. So desirable 
though it is, simultaneous proficiency in both the fields not commonly achieved. 

VOStat0 is a web service that is designed to help in this endeavor. Oriented towards tabular 
data, rather than images or spectra, it is an effort to bring statistics to the astronomer's desktop. 
There seems to be two ways to introduce statistics to an astronomer. The first is to ask her to read 
and digest a selection of standard books on statistics (most of which are written with social scientists 
in mind), and then expect her to apply the ideas to problems in her own researc h. Statistics 



texts designed for the physical scientist are n ow emerging to assist in this process (| Jamed 120061 ; 



Wall &: Jenkins! [2012J; iFeigelson fc Babul |2012| ) . The second approach is to offer her a convenient 
Web service with a menu of statistical methods to choose from. Data can be uploaded from her local 
computer or acquired from the Virtual Observatory, a worldwide federation of public astronomical 
databases and associated analysis services. She can pick methods and immediately apply them to 
her data without having to download software or read more than a short non-technical description. 
If she likes the result, she may then delve deeper into the methodology and acquire the underlying 
software to investigate the initial findings more carefully on her machine. 

VOStat takes the latter approach. It uses R, the largest free, public domain statistical software 
to perform its computations. The output of VOStat is not merely the output of R, but also the R 
scripts (generated automatically by VOStat based on the user's specification). Indeed, the main goal 
is not to encourage astronomers to outsource their statistical requirements to VOStat, but to teach 
them how to be able to carry out independent statistical analysis of their data themselves. VOStat 
is available to anyone on the World Wide Web, but is particularly oriented towards astronomers 
drawing upon Virtual Observatory datasets and tools. 



2. The R Statistical Software Environment 

For most of the computer age, the primary software implementing statistical analyses were 
commercial packages. The Statistical Analysis Syste providing a vast array of built-in proce- 
dures in a Fortran-like programming language, is the most comprehensive, serving ind ustries and 



gover nments. During the 1980s, John Cambers and colleagues at AT&T developed S ([Chambers 
Il998l ). a statistical programming language written in C, that evolved into the commercial package 
In the 1990s, New Zealand statisticians Ross Ihaka and Robert Gentleman rewrote S in 
the public domain, calling it R, and released it under GNU General Public License software prod- 
uct under the auspices of a non-profit R Foundation for Statistical Computing led by a growing R 



1 VOStat is available at http://vostat.org 
^http: //w ww. sas . com| 

a http: //spotf ire .tibco . com/products/s-plus/ statistical- analysis- software . aspx 
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Development Core Team (jlhaka Gentlemanlll996l ) 



In the spirit of open source software, the R team set up the Comprehensive R Archive Net- 
work (CRAN ), a structure for external experts in statistical computing to contribute specialized 
packages. About 20 of the early packages were incorporated into the base R package, which is now 
a reasonably stable product. But CRAN contributions continued to enter the system, growing 
exponentially since 2001. Today there are > 4000 CRAN packages with a new package entering 
every day. They cover every field of applied statistics with particularly strong contributions from 
the biology (particularly genomics) and econometrics communities. The R/ CRAN system now 
has > 60, 000 statistical functionalities, some simple and others very complex. CRAN packages 
are easily retrieved on-the-fly during an R session. 

As there is no global index to CRAN , it can be tricky to find the appropriate functionality 
for the task at hand. The CRAN Task Views__ provide brief but useful summaries of packages 
in ~ 30 broad topics. These include Bayesian inference, chemometrics and computational physics, 
cluster analysis and finite mixture models, graphic displays and visualization, high performance 
computing, machine learning, medical imaging, multivariate analysis, optimization and mathemat- 
ical programming, robust statistical methods, spatial data, survival ana lysis (treating upper limits) 



and t ime series analysis. Publications using R/ CRAN should cite |R Development Core Team 



(|2012l ) and the reference given by the function citation{package} . 



The capabilities of VOStat are similar in scope to those provided by other R graphical user 
interfaces (GUIs) like Deducer and is less extensive than some like R Commanded The purpose for 
creating a new R GUI for astronomy is interoperability with other Virtual Observatory capabilities 
through its Simple Application Messinging Protocol (SAMP), so the astronomer experiences a 
flexible and capable environment for interactive scientific analysis of tabular data_§ 



3. Overview of VOStat 

The international Virtual Observatory (VO) system federates the data resources of various 
astronomical facilities around the world. It provides tools for the extraction, visualization, and 
comparison of data extracted from VO databases, as well as some specialized astronomical analyses 
such as construction and fitting spectral energy distributions and lightcurves for variables objects. 
This system provides easy web-access to a vast amount of astronomical data, and allows astronomers 
to communicate findings easily for multi-observatory studies. Its impressive scope notwithstanding, 



4 http: //cran.r-project . org/web/ views 

5 See http://www.sciviews.org/_rgui for details on ~ 25 GUIs developed for R. 

6 Virtual Observatory science software tools and services with SAMP connectivity are available 
at http : //www.usvao . org/science- tools- services/ , http: //www. euro- vo . org/pub/f c/sof tware .html , and 
http: //wiki . ivoa.net/twiki/bin/view/IVOA/SampSoftware 
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the VO system lacks comprehensive statistical tools suitable for analyzing the data extracted by a 
user. 

The VOStat project addresses this deficiency with two goals: it provides a statistical toolbox 
well integrated into the VO system to enable extracted astronomical data to be analyzed effi- 
ciently and effectively; and it spreads awareness among practicing astronomers about the need of 
complementing domain knowledge with statistically valid data analysis techniques. VOStat seeks 
to achieve these aims with a web-based astrostatistical computing portal that links VO-extracted 
datasets to the statistical analysis capabilities of R, the largest free, open-source statistical com- 
puting environment. Automated connectivity between these software environments is provided by 
the VO's Simple Application Management Proocol (SAMP), a VO astronomer can (for example) 
interactively acquire tabular datasets from the VO Data Discovery Tool, view and manipulate the 
tables using TOPCAT or VOTable, and perform statistical analyses with graphics using VOStat. 
While the analyses provided by VOStat are not highly complex due to the limitations of the Web 
services interface, R code is provided giving a springboard for the astronomer to develop and imple- 
ment more advanced statistical methodology geared towards the particular astronomical research 
problem at hand. 

VOStat is accessed by the public at www.vostat .org. Astronomers can upload their data into 
the VOStat server via a simple Graphical User Interface that operates the chosen R functions on 
the user-provided dataset. The data files can be on the user's home computer, on a Web site, 
or obtained from VO-compliant tools via the VO's SAMP. VOStat thus can interact with human 
users directly (e.g., accepting data, displaying plots on screen), with VO-compliant tools via a 
SAMP hub, a specialized server running in the local machine of the user that uses SAMP as its 
communication protocol. 

Once a dataset is extracted from the VO or otherwise provided, the user chooses from a list of 
several dozen statistical techniques, the selected techniques are applied to the data, and the results 
are returned to the user. Apart from traditional output like on-screen plots and tables, VOStat 
also provides the following to understand and extend the statistical analysis of their dataset on 
their home computer: 

1. annotated R code for each step of the statistical analysis; 

2. workspace file giving internal R files of statistical analysis results; 

3. help files from R documentation and from Wikipedia describing the statistical functions; 

4. guidance for the statistically uninitiated astronomer to select the right analysis. 

VOStat is oriented towards analysis of tabular data. Commonly, the rows of the table repre- 
sent a collection of astronomical objects and the columns represent measured or inferred properties 
of the objects. But astronomical lightcurves or spectra can be analyzed, where the first column 
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gives observed times or wavelengths and the other columns give observed values. R has exten- 
sive capabilities for analysis of images (see the CRAN Task View on medical image analysis, 
|http:/ /cran.r-project.org/web/vi ews/Medicallmaging.html), but these are not incorporated into 
VOStat. Therefore, we do not encourage analysis of astronomical images using VOStat. 

A common concern among astronomers while choosing a software is the upper bound of data 
size that can be handled. VOStat imposes no further restriction on this beyond the limitation 
of R itself. Thanks to the way R is built from diverse packages created by unrelated groups of 
developers, it is difficult to quantify the data size limitations of R in general. Certain methods 
allow quite large data sets, while others do not. The primary design goal behind VOStat is to 
motivate astronomers to use R. So users are encouraged to play with the available methods using 
moderately sized subsets of their data, and then download and tweak the generated script to work 
with the entire data sets. VOStat has facilities built into it to choose a subset of data. 

Two earlier versions of VOStat were developed, the first produced by a U.S. group involving 
Penn State and Caltech, and another undertaken by VO-India and Penn State with contributions 
from Caltech and Calcutta University Statistics Department. The latter is provided for use by 
VO-Indifl These versions of VOStat did not have SAMP communication with other VO tools 
and provided a narrower suite of statistical methods. Other tools for interfacing R to the Web for 
general statistical analyses without association with the Virtual Observatory include FastRWeb, 
R2HTML, Rcgi, Rweb and Rwu@. 

4. Internal structure of VOStat 

VOStat uses Java as the interface between R and the user. The Apache Tomcat web server 
deploys the Java servlets that comprise the web interface. An R session is then spawned on the 
server computer to implement the chosen statistical procedures. Figure Q] shows the workings of 
VOStat, and the three principal modules are described here. 

4.1. Data loader module 

VOStat allows data to be fed into it from different sources: 

1. Uploading a file from the user's local machine. 

2. Uploading a file from a URL on the Internet. 



'http: //vo . iucaa. ernet . in/$\sim$voi/VOStat .html 

s http: //www. rforge .net /FastRWeb , http : // cran. r-pro j ect . org/web/packages/R2HTML 

: //www.ms .uky . edu/$\sim$statweb , http : / /www. math. mont ana. edu/Rweb , http : //sysbio .mrc-bsu. cam. ac .uk/Rwui 
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3. Getting data from VO-compatible tools via a SAMP hub. These tools include Aladin Sky 
Atlas, Datascope, Octet, Open SkyQuery, Topcat, VIM, and VisIVO. 

We rely on Java applet technology for the SAMP communication, and on multipart file up- 
loading for accessing a file from the user's local disk. The data file may be in different formats 
commonly in use among astronomers: VOTable, FITS, or ASCII tab/space/comma separated val- 
ues. VOStat can guess the file format from the file name extension, or the user may specify the file 
format from a drop-down menu. The file extensions recognized by VOStat are .txt, .dat, .csv, .fits, 
.vot and .xml. These formats are read using R utilities. 

For datasets available on the Web, the first page of VOStat has a text field where the user 
can specify the URL of the file to be uploaded. A similar interface allows files to be uploaded from 
the user's hard disk. Obtaining data via SAMP from VO-compliant tools is slightly more involved, 
because the SAMP hub must run in the user's local machine while VOStat resides in a remote 
server. For this we embed a trusted Java applet in the webpage served by the VOStat server. This 
applet contains two clients in it, a SAMP client and an HTTP client. The SAMP client looks 
for and connects to any SAMP hub found running in the local machine. Once a file broadcast 
message (MType: table. load. votable) is received by the SAMP client, the HTTP client establishes 
connection with the remote VOStat server. The file is read by the applet and sent over the internet 
to a servlet running in the VOStat server. This indirect process is required because, during 2011-12 
when VOStat was developed, a SAMP hub can run in the local machine while VOStat tables are 
broadcast via local URLs only. 

4.2. Analysis Module 

This is written entirely in R. For the more commonly used statistical functions, the code uses 
functions in base R. For more specialized functions, relevant CRAN packages are first installed on 
the server machine and the CRAN functions are run on the dataset. In a few cases, we implement 
our own analysis techniques designed for astronomers. The VOStat architecture is created with 
the design goal that new statistical techniques may be added easily without disrupting the existing 
workflow. The analysis tools currently in VOStat are briefly described in £}5j 

4.3. Output Module 

Three types of output from VOStat are provided. First, various plots, estimated parameter 
values, confidence intervals, correlation coefficients, diagnostic messages, and other ASCII results 
are for direct use by the user. Numerical output is usually short and appears on the screen. 
Plots can be downloaded in PDF or Encapsulated PostScript formats. Second, tables meant for 
consumption by other SAMP-aware VO tools are broadcasted through the local SAMP client. 
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Third, the R binary workspace (usually called .Rdata) is available for download so that an analysis 
can be continued seamlessly at a later session. This can assist users who want to continue more 
varied and advanced using R installed in their own machines. 

To further the goal of educating users about statistical tools, VOStat also gives background 
information about the performed analysis. First, annotated R codes can be 'cut and paste' to 
reproduce the analysis on a home computer. The VOStat computation is thus not an opaque 
'black box' but can be reproduced and extended by the scientist. Second, links are provided to 
the R documentation for each important R function used. These 'help files' have a standard 
form defining the function, presenting the inputs and outputs, algorithms, and generic examples 
of their use. Third, links are given to Wikipedia entries to give a more detailed and mathematical 
presentation of the statistical method. References to the methodological literature are given in both 
the R help files and Wikipedia entries. Nearly all of the statistical functions in VOStat are also 
described, with applications to astro nomical datasets, in the textbook Modern Statistical Methods 



in Astronomy with R Applications (IFeigelson fc Babul 12012 ). This text provides 19 astronomical 



datasets that can be used as exercises for VOStat application and more elaborate R code illustrating 
statistical analyses of these datasetg^]. 

For analyses that produce new data worth feeding into other VO-tools (for example, for ad- 
vanced visualization) can be routed via a SAMP hub running in the user's machine. Like other 
web-based VO-tools, VOStat uses trusted applets to connect to a SAMP hub running in the local 
machine. However, unlike some other tools like Aladin Sky Atlas, VOStat does not start a local hub 
itself. It only connects to a hub that is already running. This is to avoid disrupting coordination 
of the local VO-tools in case of a browser failure. 



VOStat statistical functions 



VOStat (Rev. 2, December 2012) currently provides over 50 statistical functionalities organized 
into 11 topics. We review these briefly here. R and CRAN functions are given in brackets and 
italics, respectively. Modules that produce output that can be automatically transferred to another 
VO-compatible client via SAMP are labeled 'SAMP'. Figures of selected graphical outputs were 
generated in a separate session of R from the VOStat R codes in order to cosmetically improve 
them for publication (e.g., axis labels). Intermediate -level descriptions of thes e statistical functions 
with applications to astronomical data are given by lFeigelson &: Babul (120121 ) . 



"http : //astrostatistics .psu. edu/MSMA/datasets and http: //astrostatistics .psu. edu/MSMA/MSMA_R_scripts .html 



- 8 - 



5.1. Transforming and subsetting the data 

Data summary This gives minimum, quartiles (25%, 50%, 75%), and maximum values for 
each variable, [summary] 

Select cases Rows can be selected using operations <, <, >, and > applied to specified 
columns. 

Simple transform Variables can be transformed with log 10 x, ^/x, tyx, x 2 , x 3 , 1/x and 
standardization (centering and scaling). 



5.2. Plots and summaries 



Boxplot Boxplots show th e minimum, maximum, median and quartiles of a univariate 
distribution with tails and outliers (lTukeyill977l ). [boxplot] 



Dotplot Dotplots show univariate values of individual objects with labels. This is appro- 
priate (in terms of visual clarity) only for small datasets. [dotchart] 

Histogram Histograms show the univariate distribution grouped into bins. The user can 
specify the number of bins or the bin boundaries. If none are given, then the number of bins use s 
the Preedman-Diaconis algorithm based on the inter-quartile range ( Freedman &: DiaconislE98ll ). 

[hist] 

Averaged shifted histogram Averaged shifted histograms are univariate density estima- 
tors (smoothers) that repeate dly bins the data with shifted origins and convolved with a quartic 
(biweight) kernel (IScottill992l). The user sp ecifies the number of bins used. This is implemented in 
the CRAN ash package (jScott et al.ll2012l ). An example is shown in Figure [2b. [as/ii] 



Scatterplot This is the common plot of individual data points in two variables specified 
by the user. Numerous options in formatting this plot (and other VOStat graphical outputs) are 
available to the user who adjusts the plot parameters in R on their home computer, [plot] 

Pairs plot A pairs plot is a matrix of scatterplots for a multivariate dataset, sometimes 
called a SPLOM (for 'Scatter PLOt Matrix'). It is often valuable for examining patterns and 
locating problems in multivariate data. In practice, a limited number of variables (say < 10) can 
be shown in a single plot. An example is shown in Figure [2U. [pairs] 

3D scatterplot This is a plot of three y ariables specified by the user implemented in CRAN 
package scatterplotSd (ILigges fc Machlerl 120031 ) . [scatterplotSa] 



Summary statistics This gives mean, median, variance, median absolute deviation for each 
variable specified by the user. It also shows a matrix of Kendall's r nonparametric coefficient of 
bivariate correlation with associated probability values obtained with CRAN package SuppDists 
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(jWheeleii 120091 ) . [mean, median, var, mad, cor, pKendal\ 



5.3. Estimating distributions from data 



Histogram and Averaged shifted histograms See §5.21 

Empirical distribution function This provides a sequence of plots of the normalized 
cumulative step-function of a single variable with 95 % confidence bands b ased on the Kolmogorov- 
Smirnov statistic using the sfsmisc CRAN package ([Machler et al.ll201ll ). [ecdf.ksCI\ 



Kernel density estimation: 1-dimensional, fixed window width This function con- 
volves the data with a Gaussian function. Here a constant bandwidth selected by the user is used; if 
no bandwidth is specified, a default optimized to a Gaussian distribution for the data, h = (^) 1 ^ 5 c 
where n is the number of data points and a is the sample standard deviation. A rug plot of the 
input data and 95% confi dence bands around the smooth estimator are shown. This is produced 
with CRAN package sm ([Bowman &: Azzalinilll997l . I201Q ). [sm. density, plot, lines, rug] SAMP 



Kernel density estimation: 1-dimensional, adaptive window width This is a so- 
phisticated robust L-estimator that applies a Gaussian kernel with bandwidth adapted to the local 
concentration of da ta points implemented in the CRAN package quantreg (jPortnov &; Koenker 
19891 ; lKoenkeril2012l ). The user can adjust a parameter that determines the sensitivity of the local 
bandwidth to local variations, [akj, seq, plot, rug] SAMP 

Kernel density estimation: 2-dimensional, fixed window width This function con- 
volves the data with a Gaussian function of fixe d bandwidth and produces a colorful 3-dimensional 
surface plot that can be interactively rotated ([Bowman fc Azzalinil 119971 ). Confidence bands are 
available but not plotted, [sm. density] SAMP 

Fitting distributions to data This performs a maximum likelihood fit of a statistical 
distribution to a univariate dataset. Fifteen distribution are provided: beta, cauchy, chi-squared, 
exponential, F, gamma, geometric, log-normal, logistic, negative binomial, normal, Pareto (power 
law), Poisson, t and Weibull. The best-fit parameters are obtained numerically except for the 
normal, log-normal, geometric, exponential and Poisson distributions where analytic solutions are 
used. Output includes best fit parameters, their estimated standard errors fro m the Fisher infor- 
mati on matrix, and the log-likelihood. The R implementation is described by ([Venables &: Ripley 
2002). The only exception is Pareto distribution, for which VOStat uses its own code, [fitdistr] 



5.4. Comparing data with distributions 



Normal plot This function provides a quantile-quantile (Q-Q) plot of a univariate dataset 
compared with the normal (Gaussian) distribution. If the distribution agrees well with a normal, 
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then a variety of classical statistics can be applied, [qqnorm, qqline] 

Tests for normality Several hypothesis tests are provided to test whether a univariate 
dataset is consistent with a normal (Gaussian) distribution. These include the Shapiro- Wilk test 
in base R, and the Lilliefors (specialized Kolmogorov-Smornov), Cramer- von Mises, Anderson- 
Darling, Pearson (chi-square) , and Grubbs tests from CRAN package nortest. [shapiro.test, lil- 
lie.test, cvm.test, ad. test, pearson.test, sf.test] 

One-sample KS test A selected univariate sample is compared to a uniform, exponential 
or normal distribution (with user-specified parameters). The two cumulative distribution functions 
are shown with the differences, and the Kolmogorov-Smirnov 1-sample test is applied, [ks.test, seq, 
punif, pexp, pnorm, ecdj\ 

Anderson-Darling test As in the previous function, a univariate sample is tested against 
a uniform, exponential or normali distribution using the Anderson-Darling test. This is more 
sensitive than the Kolmogorov-Smirnov test for multiple small-scale deviations and differences at 
the edges of the distribution. The function, implement ed in CRA N package adk, permits A;-sample 



comparisons, but this is not implemented in VOStat (|Scholal2008l ). [adk. test] 



Two-sample KS test The distributions of two variables in a multivariate dataset are 
compared using the two-sample Kolmogorov-Smirnov test. The user selects whether a one-sided or 
two-sided test is desired, [ks.test] 

Chi-square two-sample test Two univariate samples of grouped (binned) count data are 
combined into a contingency table, and their consistency is estimated using Pearson's chi-squared 
test. The two samples must have the same number of bins, [table, cbind, chisq.test] 



5.5. Some standard tests 

One-sample Z test This is a hypothesis test that compares the mean of a chosen univariate 
dataset with a user-specified value. It applies when the dataset is approximately normal (Gaussian) 
distribution, the variance is known, and the sample is sufficiently large, [nrow, pnorm] 

One-sample t test This is a similar hypothesis test for the mean but does not require prior 
specification of the variance, [t.test] 

Paired t test This is a similar hypothesis that the means of two univariate samples have 
the same mean (or with a specified offset). Note the assumption of Gaussianity. [t.test] 
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5.6. Regression 



Linear regression Multivariate linear regression is performed us ing R's important Im 
(linear modeling) function. Its use is described in detail by ISheatheii (]2009i ). The user specifies a 
single response variable (optionally weighted by measurement errors) and one or more independent 
variables (without measurement errors). The fitted regression coefficients and their confidence 
intervals are produced, along with diagnostic residual plots to assist in evaluating the quality of 
the fit. (Im, length, for, plot) SAMP 

Nadaraya- Watson local regression This is a nonparametric bivariate kernel regression 
technique to estimate conditional expectation of the random variable Y on X. It can find nonlinear 
relationships of unknown functional form. A constant bandwidth for a Gaussian kernel derived by 
Nadaraya and Watson in 1964 is used. The resulting plot shows the bivariate scatter plot, local 
regression curve, and 95% confidence intervals obtained from bootstrap resa mples. An example 
is sh own in Figure Et. The algorithm is implemented in CRAN package np (jHayfield Racine 
20081 ). [npregbw, npplot, points] 



Major axis/orthogonal regression While ordinary linear regression requires specification 
of a 'response' or dependent variable, astronomers sometimes seek intrinsic relationships that treat 
random variables in a symmetrical fashion. VOStat implements t hree symmetrica l bivariate least- 
squares linear regression fits using the CRAN package l model2 (|Le gendral 120081. Astronomical 
applications of symmetric regression lines are described in iFeigelson &: Babul (11992 ) . 



LOESS regression A popular method for local polynomial regression fitting based o n 
weighted least squares fitting of linear functions locally around each data point ([Cleveland! Il994h . 
A local parabola is used here, and the user specifies a smoothing bandwidth parameter, [loess, 
SAMP, plot, order, lines] SAMP 

Linear quantile regression Quantile regression is a sophisticated method that fits, for 
sufficiently large datasets, linear and nonlinear relationship s to the median or other quantile of the 
response variable conditioned on the independent variable (jKoenkerl 120051 ) . This is valuable if the 
scatter about the relationship is asymmetrica l or otherwise non-Gaussian. The calculation is made 
by CRAN package quantreg (|Koenkeiil2012l ). 



Robust regression using an M-estimator A method of iterative least square s multiyariat e 



regression where the influence of outliers are down-weighted using Huber's tp function (jHuber 



1981 



The VOStat implementation is restricted to linear fits, but a much wider range of functions is 
permitted in R. [rim, plot] SAMP 

Robust regression using trimmed least squares Another linear robust reg ression 
method that removes, rather than downweights, outlying points (jRousseeuw &: Leroyl 120031 ) . Here 
the the sum of the quantile smallest squared residuals is minimized. [Itsreg, plot] SAMP 
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5.7. Multivariate methods 

Pairs plot See gOJ 
Linear regression See §5.61 

Principal component analysis An important procedure to find the most important linear 
combinations of variables that account for the variance in the dataset. [princomp, na.omit, plot] 
SAMP 

Canonical correlation analysis A multivariate procedure that finds two subsets of vari- 
ables whose linear combinations have maximum correlation with each other, [cancor, no. omit] 

Model based clustering This is an important method to find structures in multivariate 
data assuming they follow multivariate normal (Gaussian) distributions. These can be shaped like 
spheres, ellipsoids, panca kes or cigars in p-dimensions. This is called a 'normal mixture model' 



(jMcLachlan fc Peell 12000). Parameters are obtained by maximum likelihood estimation using the 



EM Algorithm with model selection (i.e., choice of number of clusters) based on the Bayesia n 



Information Criterion. This well-known code is in CRAN package mclust (IFraley Sz Rafteryll2006l ). 
The VOStat code produces a plot of BIC vs. number of clusters, a pairs plot with symbols reflecting 
cluster membership, and tabular output giving the multivariate means and variances of each cluster. 
[mclutBIC, plot, summary] SAMP 

Hierarchical clustering Nonparametric hierarchical clustering is based on agglomerating 
proximate data points into increasingly larger groupings using a chosen metric and cluster criterion. 
VOStat assumes a Euclidean metric, so standardization of variables may be desirable. Clustering 
criteria include: Ward's, single linkage, average linkage, complete linkage, McQuitty's criterion, 
median and centroid. Note that 'single linkage' clustering is commonly called the 'friends-of- friends' 
algorithm i n astronomy, but is not recommended due to its propensity to 'chain' together distinct 



structures (jEveritt et al.ll201ll ). [dist, hclust, plclust, cophenetic, cor, cutree] SAMP 



5.8. Spatial methods 

Variogram A variogram is a function measuring the spatial dependence of a random field 
that is sampled at data points in two dimensions. Three variables are involved: the variable whose 
spatial pattern is of interest, and two variables representing location. The plot shown here gives 
the variance of the difference between the intensity of the process at two locations separated by 
a distance in the spatial plane. The calculation assumes the point process is stationary (same 
behavior at all locations) and isotropic. The plots of temperature fluctuations as a function of 
angular scale for the cosmic microwave background, and of velocity dispersion as a function of 
physical scale in molecular clo uds are astrono mical examples of variograms. It is implemented here 



in the CRAN package gstat (jPebesmal 12004 ). [variogram, plot] 
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fc-nearest neighbors and Moran's / test The fc-nearest neighbors method useful for 
forming an idea about the topology of a point process. Each point is connected to its k nearest 
neighbors. Moran's test can check for spatial autocorrelation of a variable with respect to this 
topology. It gives the probability that the autocorrelation is consistent with complete spatial 
randomness, [spdep, RANN\ 

Ripley's K and related measures Astronomers commonly examine the 2-point correlation 
function (known in statistics as the 'pair correlation function') for clustering of a stationary point 
process, showing the strength of spatial clustering as a function of distance. Statisticians commonly 
use the integral of the 2-point correlation function to avoid arbitrary binning; this is known as 
Ripley's K function. The plot provided by VOStat shows the K function with two corrections 
of edge effects, and a theoretical curve assuming complete spatial randomness. Also shown are 
Besag's L (the K function rescaled to be horizontal in the null case of random distributions), 
Baddeley's J (a different function that is less sensitive to edge effects), and the pair co rrelation 



function. Computations are performed with the CRAN package spatstat (IBaddeleyl 12005). A pair 



correlation function is shown in Figure [2b. [as. matrix, owin, as.ppp, Kest, Lest, Jest, pcf, plot] 

Voronoi tessellation and adaptive 2D kernel density estimation The Voronoi (or 
Dirichlet) tessellation is a division of the space into distinct regions around each data point defined 
such that each tile contains the locations closer to that point than any other point. The result is the 
division of the space into polygons around each point. The inverse of the polygonal area is a measure 
of the local density of points, and is used to create a nonparametric adaptive smoother of the 
distribution of points. VOStat provides both a 2-dimensional grey-scale map and a 3-dimensional 
perspective plot of the smoothed distribution that can be interactively rotated. These operations are 
computed with spatstat. A Voronoi tessellation and its associated adaptively smoothed distribution 
are shown in Figure [2f-g. [as. matrix, owin, dirichelet, adaptive, density, plot, surfSd] 



5.9. Directional statistics 

Plotting directional data A circular plot is done to provide an idea about the general 
shape of the data. It is an anlog of barplot or histogram for directional data. A directional data 
plot is shown in Figure [2b,. 

Summarizing directional data Since directional data deal with angles, one has to be 
careful about 2tt radians wrapping back to 0. VOStat computes the mean direction as well as the 
mean resultant length, [circular] 

Kernel density estimation This tools computes kernel density estimate from directional 
data, [circular] SAMP 

Fitting a von Mises distribution The von Mises family of distributions is like an analog 
of the Gaussian family for directional data. Given a circular data set this tool performs maximum 
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likelihood fitting of a von Mises distribution to it. [CircStats] 

Testing goodness-of-fit with a von Mises distribution Usually fitting a distribution 
to a data is not enough until we know if the fit is adequate. Watson's test checks goodness of fit of 
a given data set with a fitted von Mises distribution. It requires bootstrapping. [CircStats] 

Testing goodness-of-fit with a uniform distribution Same as above but with uniform 
distribution in place of von Mises distribution. Four different tests are performed: Raleigh test, 
Watson test, Kuiper test and Rao's spacing test. [CircStats] 



5.10. Survival analysis 

Maximum likelihood Kaplan-Meier estimate of the survival curve for censored data 

Astronomers often observe objects without detection, giving rise to upper limits of the observed 
property. In statistics, these limits are called left-censored data points, and are treated with 
methods from 'survival analysis'. The Kaplan-Meier estimator is the unique maximum likelihood 
estimator of a randomly censored univariate sample. Confidence intervals are based on Greenwood's 
formula. A VOStat plot of a Kaplan- Meier estimator is shown in Figure [3j [Surv, surfit, plot, lines] 

Tests for censored data Two-sample tests to compare different univariate samples with 
censoring. Generalizations of the Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling 
tests are applied. Here, censoring patterns need not be random or the same in the two samples. 
[survdiff] 

Maximum likelihood Lynden-Bell-Woodroofe estimator While censoring occurs when 
known objects are not detected in some property, truncation occurs when an unknown number of 
unknown objects are not detected due to sensitivity limitations. Analogous to the better-known 
Kaplan-Meier distribution for censoring, the Lynden-Bell-Woodroofe estimato r gives the unique 



maximum likelih ood estimator for a randomly truncated univariate variable (lLynden-Belll 11971 



Woodroofd[l985h . Useful for estimation of luminosity functions in flux-limited astronomical surveys 



its mathematical properties are o ften better tha n heuristic functions like the widely-used 1/V r 



estimator with arbitrary binning (|Schmidtlll968l ). 



5.11. Time series analysis 

Nearly all time series analysis in R and CRAN requires regularly spaced data without gaps. 
These methods are thus inappropriate for many irregularly spaced astronomical datasets. 

Time series plotting with smoothing This VOStat function plots a univariate time 
series with Gaussian kernel smoothing based on user-provided bandwidth. A time series plot is 
shown in Figure H^,. [plot.ts, ksmooth, lines] 
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(Partial) Autocorrelation Function (ACF and PACF) Computation and plot of the 
autocorrelation function of the univariate time series; this is the correlation as a function of lag 
time. The partial autocorrelation function removes the effects of intermediate lags. A PACF plot 
is shown in Figure 0b. [acf, pacj\ 

Autoregressive modeling This is a maximum likelihood fit of the time series to a stochastic 
autoregressive model, where the order of the model is selected by the maximum of the Akaike 
Information Criterion (AIC), a penalized likelihood measure. VOStat provides plots of AIC against 
order, and shows the periodogram of the resulting best-fit model, [ar, plot] 

Periodogram with smoothing option This function provides a Schuster periodogram 
based on a Fourier transform of the time series, with Daniell (boxcar) smoothing. The time series 
is first detrended by a linear function, a 10% taper is applied, and a 95% confidence interval 
assuming white noise is shown. A periodogram of an astronomical time series is shown in Figured!:. 
[spec.pgram] 



5.12. Conclusion 

While many resources are applied to the reduction of data as it arrives from telescopes, VOStat 
is one of the relatively few resources available to assist the astronomer in the later stages of scientific 
analysis, such as the discovery of patterns in the observations and compa rison of observed properties 



of celestial objects with astrophysical theory. As eloquently phrased by iGregoryi (120051 ): 



The goal of science is to unlock natures secrets. ... Our understanding comes through 
the development of theoretical models which are capable of explaining the existing obser- 
vations as well as making testable predictions. ... Fortunately, a variety of sophisticated 
mathematical and computational approaches have been developed to help us through 
this interface, these go under the general heading of statistical inference. 

VOStat provides a user-friendly, interactive, Web-based interface that applies several dozen 
statistical operations to a user-supplied dataset. A unique difference from other statistical services 
is its integration with other interactive tools of the Virtual Observatory using the VO's SAMP 
communication system between applications. Using VO tools, an astronomer can extract carefully 
chosen datasets from huge, widely distributed, multiwavelength databases, analyzing and enhancing 
the data in various ways within VOStat. Some results from VOStat analysis can then be broadcast 
back to other VO applications through SAMP for visualization and further analysis. 

The engine behind VOStat is the remarkable R statistical software system, the most com- 
prehensive available public domain data analysis environment. We emphasize that VOStat taps 
only a small fraction of R, and a tiny (< 1%) fraction of the specialized CRAN packages. Thus, 
we strongly encourage VOStat users to extend the on-line analysis with a more thorough interac- 
tive analysis on their home computer using R and CRAN. Educational materials for learning R 
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are available at http://www.r-project.org and in many books, including one designed specially for 



astronomy (jFeigelson &: Babull2012l ) 



This work is supported by the NSF 'SI2-SSE' grant AST-1047586 (PI - G. J. Babu). 
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Fig. 2. — Selected VOStat plots using 

properties of Galactic globular clusters (jWebbink 



http: / / astrostatistics.psu.edu/MSMA /datasets/GlobClus_prop.dat| 



a mu ltivar iate dataset of 

1985I ) available at 
Plot formats have been 



adjusted in R for publication, (a) Directional plot of Galactic longitudes, (b) Averaged shifted 
histogram of cluster Galactic longitudes, (c) Nadaraya- Watson local regression of globular cluster 
core radius (in pc) on the log of the central stellar density with 95% bootstrap confidence intervals, 
(d) Pairs plot showing bivariate relationships between four properties of globular clusters (visual 
magnitude, core radius, concentration and log central star density), (e) Pair (2-point) correlation 
function of globular cluster Galactic longitudes and latitudes (black), with an edge correction 
(red) and a theoretical line assuming random locations (green), (f) Voronoi tessellation of globular 
cluster visual magnitude and log central star density, (g) Adaptive smoothed image based on 
Voronoi tessellation. 
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Fig. 3. — VOStat plot for censored data (upper limits). Kaplan-Meier estim ator (with 95% confi- 
dence bands) for lithium abundances of stars studied by I Santos et al.l (|2002l ). Thirty-eight values 
are de tected and thirty are upper limits. Data and analysis are described in iFeigelson &; Babu 
(|2012h . 
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Fig. 4. — VOS tat plots for time series analysis of counts from the 'quasi-periodic oscillator' X-ray 
source GX 5-1 (|Norris et al.lll99oh . (a) Time series plot with smooth ng (bandwidth = 200 bins), (b) 
Partial autocorrelation function showing complex structure, with 95% confidence intervals assuming 
white noise, (c) Fourier periodogram with detrending, tapering and smoothing, with 9 5 % co nfidence 
interval assuming white noise. Data and analysis are described in lFeigelson &; Babul (|2012l ). 



